Project 2 - Clustering
CS (STAT) 5525
Note: Please read the entire project description before you begin. The goal of this project is to analyze the performance of clustering algorithms on several synthetic and real-world data sets. This will be done in the following steps:
We recommend installing Jupyter using Anaconda as it will also install other regularly used packages for scientific computing and data science. Some pointers to setup Jupyter notebooks on your system:
Visually explore the data sets in the experiments below, and consider the following:
Note: The discussion of this exploration is not required in the report, but this step will help you get ready to answer the questions that follow
The files for this problem are under Experiment 1 folder. Datasets to be used for experimentation: 2d data, chameleon, elliptical, and vertebrate. Jupyter notebook: cluster analysis.ipynb. In this experiment, you will use different clustering techniques provided by the scikit-learn library package to answer the following questions:
| User | Exorcist | Omen | Star Wars | Jaws |
|---|---|---|---|---|
| Paul | 4 | 5 | 2 | 4 |
| Adel | 1 | 2 | 3 | 4 |
| Kevin | 2 | 3 | 5 | 5 |
| Jessi | 1 | 1 | 3 | 2 |
import io
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import cluster
import warnings
warnings.filterwarnings("ignore")
ratings = [['john',5,5,2,1],['mary',4,5,3,2],['bob',4,4,4,3],['lisa',2,2,4,5],['lee',1,2,3,4],['harry',2,1,5,5]]
titles = ['user','Jaws','Star Wars','Exorcist','Omen']
df = pd.DataFrame(ratings,columns=titles)
df = df.rename(columns=lambda x: x.strip())
df = df.set_index('user')
df
| Jaws | Star Wars | Exorcist | Omen | |
|---|---|---|---|---|
| user | ||||
| john | 5 | 5 | 2 | 1 |
| mary | 4 | 5 | 3 | 2 |
| bob | 4 | 4 | 4 | 3 |
| lisa | 2 | 2 | 4 | 5 |
| lee | 1 | 2 | 3 | 4 |
| harry | 2 | 1 | 5 | 5 |
import pandas as pd
ratings = [['john',5,5,2,1],['mary',4,5,3,2],['bob',4,4,4,3],['lisa',2,2,4,5],['lee',1,2,3,4],['harry',2,1,5,5]]
titles = ['user','Jaws','Star Wars','Exorcist','Omen']
df = pd.DataFrame(ratings,columns=titles)
df = df.rename(columns=lambda x: x.strip())
df = df.set_index('user')
new_data = [['Paul',4,2,4,5], ['Adel',4,3,1,2], ['Kevin',5,5,2,3], ['Jessi',2,3,1,1]]
new_df = pd.DataFrame(new_data, columns=titles)
new_df = new_df.rename(columns=lambda x: x.strip())
new_df = new_df.set_index('user')
df = df.append(new_df)
df
| Jaws | Star Wars | Exorcist | Omen | |
|---|---|---|---|---|
| user | ||||
| john | 5 | 5 | 2 | 1 |
| mary | 4 | 5 | 3 | 2 |
| bob | 4 | 4 | 4 | 3 |
| lisa | 2 | 2 | 4 | 5 |
| lee | 1 | 2 | 3 | 4 |
| harry | 2 | 1 | 5 | 5 |
| Paul | 4 | 2 | 4 | 5 |
| Adel | 4 | 3 | 1 | 2 |
| Kevin | 5 | 5 | 2 | 3 |
| Jessi | 2 | 3 | 1 | 1 |
df.columns
Index(['Jaws', 'Star Wars', 'Exorcist', 'Omen'], dtype='object')
k_means = cluster.KMeans(n_clusters=2, max_iter=100, random_state=2)
k_means.fit(df)
labels = k_means.labels_
labels_df = pd.DataFrame(labels, index=df.index, columns=['Cluster ID'])
labels_inverted = 1 - labels
labels_df_inverted = pd.DataFrame(labels_inverted, index=df.index, columns=['Cluster ID'])
print(labels_df_inverted)
Cluster ID user john 1 mary 1 bob 1 lisa 0 lee 0 harry 0 Paul 0 Adel 1 Kevin 1 Jessi 1
k_values = [1, 2, 3, 4, 5, 6 ]
sse_values = []
for k in k_values:
k_means = cluster.KMeans(n_clusters=k)
k_means.fit(df)
sse_values.append(k_means.inertia_)
plt.plot(k_values, sse_values, 'bo-')
plt.xlabel('Number of clusters (K)')
plt.ylabel('SSE')
plt.show()
We chose k=2 clusters based on the elbow plot, which measures the distance between data points and assigned centroids in k-means clustering. As k increases, SSE decreases until a point of diminishing returns is reached where adding more clusters doesn't improve clustering quality significantly.
df=df = pd.read_csv('Experiment1/dataset/vertebrate.csv')
df.columns
import numpy as np
from scipy.spatial.distance import pdist
from scipy.cluster import hierarchy
data = pd.read_csv('Experiment1/Dataset/vertebrate.csv',header='infer')
names = data['Name']
Y = data['Class']
X = data.drop(['Name','Class'],axis=1)
Yarr = pd.factorize(Y)[0].reshape(-1,1)
Ydist = pdist(Yarr,metric='hamming');
Z = hierarchy.linkage(X.to_numpy(), 'single')
c, Zdist = hierarchy.cophenet(Z,Ydist)
print ("single link",c)
Z = hierarchy.linkage(X.to_numpy(), 'complete')
c, Zdist = hierarchy.cophenet(Z,Ydist)
print("\n complete link",c)
Z = hierarchy.linkage(X.to_numpy(), 'average')
c, Zdist = hierarchy.cophenet(Z,Ydist)
print ("\n group average",c)
single link 0.35580411323343614 complete link 0.6063706366458652 group average 0.4886522572675798
df = pd.read_csv('Experiment1/Dataset/chameleon.data',delimiter=' ', names = ['x','y'])
df
| x | y | |
|---|---|---|
| 0 | 650.914 | 214.888 |
| 1 | 41.767 | 179.408 |
| 2 | 509.126 | 233.749 |
| 3 | 486.403 | 152.427 |
| 4 | 46.883 | 367.904 |
| ... | ... | ... |
| 1966 | 631.430 | 210.478 |
| 1967 | 187.652 | 247.923 |
| 1968 | 124.996 | 264.847 |
| 1969 | 522.511 | 302.785 |
| 1970 | 350.695 | 269.386 |
1971 rows × 2 columns
df.plot.scatter(x='x',y='y')
<AxesSubplot: xlabel='x', ylabel='y'>
from sklearn.cluster import DBSCAN
cluster_counts = {i: len(np.unique(DBSCAN(eps=15.5, min_samples=i).fit(df).labels_))
for i in range(1, 6)}
plt.figure(figsize=(8, 6))
plt.plot(cluster_counts.keys(), cluster_counts.values())
plt.xlabel("Min Samples")
plt.ylabel("Number of Clusters")
plt.title("DBSCAN")
plt.show()
for i in range(1, 6):
db = DBSCAN(eps=15.5, min_samples=i).fit(df)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = pd.DataFrame(db.labels_, columns=['Cluster ID'])
result = pd.concat((df, labels), axis=1)
n_clusters = labels.max()[0] + 1
plt.figure(figsize=(8, 6))
plt.scatter(x='x', y='y', c='Cluster ID', cmap='jet', data=result)
plt.xlabel('X')
plt.ylabel('Y')
plt.title(f'DBSCAN Clustering - Min Samples: {i}, Number of Clusters: {n_clusters}')
plt.show()
from sklearn.cluster import KMeans
data1 = pd.read_csv('Experiment1/Dataset/2d_data.txt', delimiter=' ', names=['x', 'y'])
data2 = pd.read_csv('Experiment1/Dataset/elliptical.txt', delimiter=' ', names=['x', 'y'])
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
data1.plot.scatter(x='x', y='y', ax=axes[0])
axes[0].set_title('Dataset 1')
data2.plot.scatter(x='x', y='y', ax=axes[1])
axes[1].set_title('Dataset 2')
kmeans = KMeans(n_clusters=10, max_iter=50, random_state=1)
kmeans.fit(data1)
labels1 = pd.DataFrame(kmeans.labels_, columns=['Cluster ID'])
result1 = pd.concat((data1, labels1), axis=1)
kmeans.fit(data2)
labels2 = pd.DataFrame(kmeans.labels_, columns=['Cluster ID'])
result2 = pd.concat((data2, labels2), axis=1)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
result1.plot.scatter(x='x', y='y', c='Cluster ID', ax=axes[0])
axes[0].set_title('K-Means Clustering - Dataset 1')
result2.plot.scatter(x='x', y='y', c='Cluster ID', ax=axes[1])
axes[1].set_title('K-Means Clustering - Dataset 2')
print(f"Number of clusters in data 1: {labels1.max()[0]}")
print(f"Number of clusters data 2: {labels2.max()[0]}")
Number of clusters in data 1: 9 Number of clusters data 2: 9
The files for this problem are under Experiment 2 folder. Datasets to be used for experimentation are : samsung test labels, samsung train labels, samsung train, samsung test. Jupyter notebook: pca and clustering.ipynb. The data comes from the accelerometers and gyros of Samsung Galaxy S3 mobile phones (https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones).
In this data, the type of activity a person was performing with a phone in their pocket is also known - whether they were walking, standing, lying down, sitting, walking up or walking down the stairs. Answer the following questions:
then the purity metric for this cluster will be 200/300, which is approximately 0.67. A higher value of this metric for a cluster signifies higher purity of the cluster. Compute this metric for all of the 6 clusters produced by running Kmeans with K = 6 on the given dataset. What is the maximum purity metric across all 6 clusters?
import os
from sklearn.preprocessing import StandardScaler
X_train = np.loadtxt(os.path.join("Experiment2/", "samsung/", "samsung_train.txt"))
y_train = np.loadtxt(os.path.join("Experiment2", "samsung/", "samsung_train_labels.txt")).astype(int)
X_test = np.loadtxt(os.path.join("Experiment2/", "samsung/", "samsung_test.txt"))
y_test = np.loadtxt(os.path.join("Experiment2/", "samsung/", "samsung_test_labels.txt")).astype(int)
X = np.vstack([X_train, X_test])
y = np.hstack([y_train, y_test])
n_classes = np.unique(y).size
print(f"Number of classes are {n_classes}")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
k = 6
kmeans = KMeans(n_clusters=k, n_init=100, random_state=1)
kmeans.fit(X_scaled)
cluster_labels = kmeans.labels_
tab = pd.crosstab(y, cluster_labels, margins=True)
tab.index = ['walking', 'going up the stairs', 'going down the stairs', 'sitting', 'standing', 'lying', 'all']
tab.columns = ['cluster' + str(i + 1) for i in range(k)] + ['all']
tab = tab.drop(['all'])
purity = round(max(tab.max() / tab.sum()), 3)
print(f"maximum purity metric is {purity} with k = {k}")
Number of classes are 6 maximum purity metric is 0.945 with k = 6
X_train = np.loadtxt(os.path.join("Experiment2", "samsung", "samsung_train.txt"))
y_train = np.loadtxt(os.path.join("Experiment2", "samsung", "samsung_train_labels.txt")).astype(int)
X_test = np.loadtxt(os.path.join("Experiment2", "samsung", "samsung_test.txt"))
y_test = np.loadtxt(os.path.join("Experiment2", "samsung", "samsung_test_labels.txt")).astype(int)
X = np.vstack([X_train, X_test])
y = np.hstack([y_train, y_test])
n_classes = np.unique(y).size
# standardize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# calculate purity for k = 6 to 10
purity_scores = []
for k in range(6, 11):
kmeans = KMeans(n_clusters=k, n_init=100, random_state=1)
kmeans.fit(X_scaled)
cluster_labels = kmeans.labels_
tab = pd.crosstab(y, cluster_labels, margins=True)
tab.index = ['walking', 'going up the stairs', 'going down the stairs', 'sitting', 'standing', 'lying', 'all']
tab.columns = ['cluster' + str(i + 1) for i in range(k)] + ['all']
tab = tab.drop(['all'])
purity = max(tab.max() / tab.sum())
purity_scores.append(round(purity, 3))
# print maximum purity score and purity scores for each k
max_purity = max(purity_scores)
print(f"The maximum purity metric for any cluster is {max_purity} with k = {purity_scores.index(max_purity) + 6}")
print(f"The purity scores for k = 6 to 10 are: {purity_scores}")
# plot purity scores vs. k
pd.Series(purity_scores, index=range(6, 11)).plot()
plt.xlabel("Number of clusters (K)")
plt.ylabel("Purity")
plt.title("Purity vs. K")
plt.show()
The maximum purity metric for any cluster is 0.959 with k = 8 The purity scores for k = 6 to 10 are: [0.945, 0.946, 0.959, 0.959, 0.958]
the purity incereses as the data gets divided further into smaller clusters, but at higher cluster numbers we have more and more outliers which lead to the drop
The files for this problem are under Experiment 3 folder. Jupyter notebook: covid-19research-challenge.ipynb. In this experiment, we will be looking at the problem of clustering real-world research articles related to COVID-19. Dataset Download URL: https:// drive.google.com/file/d/1IC0s9QoBLWFN9tRI-z2QbJJWgngfAm8w/view?usp=sharing (Filename: CORD-19-research-challenge.zip, File size: 1.58 GB). Please download and unzip this file in the Experiment 3 folder before running the Python notebook for this problem. Dataset Description: In response to the COVID-19 pandemic, the White House and a coalition of leading research groups have prepared the COVID-19 Open Research Dataset (CORD-19). CORD-19 is a resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in modern coronavirus literature, making it difficult for the medical research community to keep up. Answer the following questions.
df_covid=pd.read_csv('Experiment3/final_processed_metadata')
df_covid = df_covid[df_covid.abstract != ''] #Remove rows which are missing abstracts
df_covid = df_covid[df_covid.body_text != ''] #Remove rows which are missing body_text
df_covid.drop_duplicates(['abstract', 'body_text'], inplace=True) # remove duplicate rows having same abstract and body_text
df_covid.describe(include='all')
df=df_covid
data=df_covid
df.describe(include='all')
| paper_id | abstract | body_text | authors | title | journal | abstract_summary | abstract_word_count | body_word_count | |
|---|---|---|---|---|---|---|---|---|---|
| count | 24584 | 24583 | 24584 | 24584 | 24584 | 24584 | 24584 | 24584.000000 | 24584.000000 |
| unique | 24584 | 24547 | 24584 | 23709 | 24545 | 3963 | 24545 | NaN | NaN |
| top | 00142f93c18b07350be89e96372d240372437ed9 | travel medicine and infectious disease xxx xxx... | introduction human beings are constantly expos... | Woo, Patrick C. Y.. Lau, Susanna K. P.... | Respiratory Infections | PLoS One | Travel Medicine and Infectious Disease xxx<br... | NaN | NaN |
| freq | 1 | 5 | 1 | 7 | 3 | 1514 | 5 | NaN | NaN |
| mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 216.446673 | 4435.475106 |
| std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 137.065117 | 3657.421423 |
| min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.000000 | 23.000000 |
| 25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 147.000000 | 2711.000000 |
| 50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 200.000000 | 3809.500000 |
| 75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 255.000000 | 5431.000000 |
| max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3694.000000 | 232431.000000 |
column_labels = df.columns
column_labels
Index(['paper_id', 'abstract', 'body_text', 'authors', 'title', 'journal',
'abstract_summary', 'abstract_word_count', 'body_word_count'],
dtype='object')
import numpy as np
words = ["the", "2019", "novel", "coronavirus","sarscov2","identified", "as", "the", "cause"]
n_gram_all = []
# get n-grams for the instance
n_gram = []
for i in range(len(words)-2+1):
n_gram.append("".join(words[i:i+2]))
n_gram_all.append(n_gram)
n_gram_all
[['the2019', '2019novel', 'novelcoronavirus', 'coronavirussarscov2', 'sarscov2identified', 'identifiedas', 'asthe', 'thecause']]
use eucludian distance to find the best value between points for k and check using silouthee score
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
import seaborn as sns
# create a list to store the Silhouette scores for each value of k
silhouette_scores = []
# loop over different values of k
for k in range(8, 13):
# create a MiniBatchKMeans object with k clusters
kmeans = MiniBatchKMeans(n_clusters=k, random_state=42)
# fit the algorithm to the data
kmeans.fit(X)
# calculate the Silhouette score and append it to the list
score = silhouette_score(X, kmeans.labels_)
silhouette_scores.append(score)
# find the value of k with the highest Silhouette score
best_k = silhouette_scores.index(max(silhouette_scores)) + 8
best_k
11
t-SNE Covid-19 Articles - Clustered k=10
from IPython.display import Image
Image(filename='Experiment3/tsneclusteredk10.png')

t-SNE Covid-19 Articles - Clustered k=9
Image(filename='Experiment3/tsneclusteredk=9.png')

t-SNE Covid-19 Articles - Clustered k=11
Image(filename='Experiment3/tsneclusteredk=9.png')

From the code we find the best value of k is k=11 and the second best is k=9 as we can see from the plots k=11 is the best as it has lower overlaps compared to k= 10 and k=9.
the code below was used to plot these clusters in the covid-19research-challenge.ipynb
from sklearn.cluster import MiniBatchKMeans
k = 9
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# colors
palette = sns.color_palette("bright", len(set(y_pred)))
# plot
sns.scatterplot(data = X_embedded, x = X_embedded[:,0],y = X_embedded[:,1], hue=y_pred, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered")
# plt.savefig("plots/t-sne_covid19_label.png")
plt.show()
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(lowercase=False,analyzer=lambda l:l, max_features=2**12)
X = vectorizer.fit_transform(n_gram_all)
from sklearn.cluster import MiniBatchKMeans
k = 10
kmeans = MiniBatchKMeans(n_clusters=k)
y_pred = kmeans.fit_predict(X)
from sklearn.manifold import TSNE
tsne = TSNE(verbose=1)
X_embedded = tsne.fit_transform(X.toarray())
from matplotlib import pyplot as plt
import seaborn as sns
# sns settings
sns.set(rc={'figure.figsize':(15,15)})
# colors
palette = sns.color_palette("bright", len(set(y)))
# plot
sns.scatterplot(data= X_embedded, x= X_embedded[:,0],y = X_embedded[:,1], hue=y_pred, legend='full', palette=palette)
plt.title("t-SNE Covid-19 Articles - Clustered(K-Means) - Tf-idf with 2-grams")
# plt.savefig("plots/t-sne_covid19_label_TFID.png")
plt.show()
output for tf-idf vectorizer with plain text features
Image(filename='Experiment3/tidfplaintextfeatures.png')

output for 2-grams tf-idf vectorizer.
Image(filename='Experiment3/2gramq4oupt.png')
we see output for tf-idf vectorizer with plain text features has a better seperation